11.3 A/B/C... Testing with ANOVA

ANOVA for Multiple Groups with Numeric Data

What if you have more than two populations? Instead of an A/B test we might compare multiple groups (A/B/C/D), all with numeric data.

Analysis of Variance (or ANOVA) is the procedure that tests for statistically significant differences among three or more independent groups. It tests if any of the population means differ from the other population means. In other words, we test if the clusters formed by each individual population are more tightly grouped than the spread across all the populations. The null hypothesis is that all the population means are equal, H₀: µ₁=µ₂=µ₃.

The alternative hypothesis is that at least one mean is different from the others. Each population is assumed to be normally distributed and of equal variance. We compare the between-groups mean sum of squares to the within-group mean sum of squares. If the between-groups is much larger than the within-groups sum of squares, then some of the population means are different from each other.

The within-groups sum of squares is the sum of squares within each separate group and is calculated as we have seen before. To calculate the between-groups sum of squares, the process is to take all the data point samples together and calculate a mean and a sum of squares.

For each sum of squares, the variance is calculated by dividing by the degrees of freedom. The ratio of the between-groups variance divided by the within-groups variance is called the F-statistic. Thus, the F-statistic measures the ratio of the variance across groups means to the variance due to residual error within each group. A large F-statistic indicates that the differences between group means is greater than would be expected due to chance alone, and thus the null hypothesis is rejected.

Two types of ANOVA are the one-way ANOVA, also called the single-factor ANOVA, and the two-way ANOVA, also called the two-factor ANOVA. The one-way ANOVA uses one independent variable for multiple groups. A single independent variable assumes that all members of the groups have the same characteristic. In the following example, the single factor is what kind of offer was presented to clients on a website; otherwise, the clients are indistinguishable.

A two-way ANOVA divides the groups by two independent variables. In the following example, we could also divide our groups by gender. Thus, we would have groups divided into (1) offer received and (2) gender. The results would give an F-statistic for each variable. We will limit our discussion to the single-factor ANOVA.

ANOVA Procedure in Excel

Consider an example where every customer who visits a website gets one of two promotional offers or no offer. The goal is to see if the offers make a difference. The null hypothesis would be that neither promotional offer makes a difference over no offer or that all means are equal. We could run an ANOVA to test this hypothesis.1

Determine if the population variances are equal with =VAR.S(range). One rule of thumb is that if the ratio of the larger sample variance to the smaller is less than 3:1, then we can assume equal variances.
In Analysis ToolPak, choose Anova: Single Factor. Figure 11.7 shows the ToolPak dialog window.

Figure 11.7: ToolPak Dialog Window with ANOVA
Enter the range of all the data (include the header labels in the range if desired and check the labels box). The Hypothesized Mean Difference should be 0. Enter Alpha (usually 0.05 or 0.01) as shown in Figure 11.8. Enter the beginning cell of the output range, leaving room for at least 15 rows and 7 columns. Click OK.

Figure 11.8: ANOVA Dialog Window
Use the F-statistic and the p-value to interpret the results:

Figure 11.9 shows the results. The SUMMARY table gives the within-group sum, average, and variance. The ANOVA table shows the between-groups sum of squares (SS), the degrees of freedom (df), the mean sum of squares (like the variance; it is SS divided by df), the F value (which is MS Between divided by MS Within), and the p-value. It also shows the F-critical value.

Since the F-statistic is much greater than 1 and above the F crit value (24.69 is greater than 3.2) and the p-value is much less than 0.01 (0.0000000814), we reject the null hypothesis all means are equal.

Figure 11.9: Results of ANOVA Test

TukeyHSD Procedure in Excel

Now we know that some means are statistically different from the other means, but we don’t know which means are different. To find out, we would calculate the TukeyHSD test (also known as the Tukey-Kramer post hoc test) on all pair-wise tests for difference of means.2 Refer to Figure 11.10 as we walk through the steps to calculate the TurkeyHSD. There is not a ToolPak process to do this. It is done with calculations in the Excel spreadsheet cells.

Figure 11.10: Calculations for TukeyHSD

Find the Absolute Mean Difference between each group in the ANOVA. The is the Average of one group minus another group such as =ABS(I7-I8). Repeat for all group combinations
Calculate the Standard Error (SE) of the ANOVA which is SE_ANOVA = SQRT(MS_E/n_j) where the MS_E is the Within Groups MS (mean sum of squares) in the ANOVA output table, and n_j is the sample size for each group. In our example it could be =SQRT($I$15/G7) = 4.8472.
Calculate the Q score for each pair with the formula Calculated Q_Tukey = (Absolute Mean Difference) / SE_ANOVA. In our example it could be = G21/H21 = 6.8677.
Find the Critical Q Value for the given number of groups, degree of freedom, and significance level. The degrees of freedom can be found in the ANOVA output as the Within Groups df, or can be calculated as n-k where n is the number of data points and k is the number of groups. For our example: n=15*3=45, k=3, so df=45-3=42. Look up the value in a Studentized Range Q Table. For our example, in the table where alpha = 0.05 with df = 42 and k=3, then Q=3.436. Another resource is found at real-statistics.com or at Critical Q Values of the Studentized Range (elvers.us).
Determine which group means are different by comparing the Calculated Q_Tukey scores between each group and the Critical Q Value. If the difference is larger, then the means are statistically significantly different. For our example, both Offer1 and Offer2 are different than NoOffer, but there is no statistically significant difference between Offer1 and Offer2.

Multi-Armed Bandit Algorithm Design

The traditional classical experimental designs are great, but you should understand that in analytics there is always a push to wring more business value out of everything we do. The multi-armed bandit A/B test design is one such improvement. The traditional A/B test design is focused on statistical significance, but what if an effect is suggested but not statistically significant? Business is often less concerned with statistical significance and more concerned with optimization and taking advantage of results before the experiment is over.

Multi-armed bandit algorithms are a popular method for testing multiple treatments at once and reaching conclusions faster than traditional designs. Think of a Las Vegas slot machine with multiple arms (instead of one) from which a customer may choose, each with a different payoff. This is an analogy for a multi-treatment experiment. The goal is to win as much money as possible by identifying the winning arm as soon as possible without knowing the payoff amounts.

Suppose there are three arms on the machine (A, B, C), and arm A appears to be “winning” more often than the other two. Instead of pulling them all equally, we start pulling A more often. If C starts doing better than A, we start pulling C more often. If A was initially better due to chance, the real winner still has a chance of emerging with further testing.

We use what we learn during the experiment to optimize the results as we go rather than deploying inferior treatments for the entire experiment. Traditional A/B tests’ use of random sampling can lead to excessive exposure to inferior treatments, while multi-armed bandits alter the sampling process to incorporate information learned during the experiment to reduce the frequency of inferior treatments.

In the context of web testing, we could test multiple advertisement offers, colors, or headlines. At first the offers would be shown randomly and equally. If one offer starts to perform better than the others, we can increase the rate it is shown.

There are several algorithms for shifting sampling away from inferior treatments to a superior treatment. The epsilon-greedy algorithm follows these steps:

Generate a random number between 0 and 1.
If the number is less than epsilon (a small number between 0 and 1), flip a fair coin, and if heads show offer A, if tails show offer B.
If the number is greater than or equal to epsilon, show whichever offer has had the highest response rate so far.

If epsilon is 1, we have a standard A/B test. If it is 0, we have a purely greedy algorithm. Thus, we set epsilon to be between 0 and 1, while the closer to 0 the faster the convergence but also the more chance we might miss the “best” offer.

Another algorithm uses Thompson’s sampling, which is a Bayesian approach that updates prior probabilities as information is learned.